home *** CD-ROM | disk | FTP | other *** search
- Subject: v24i073: Public-domain replacement for compress programs, Part01/04
- Newsgroups: comp.sources.unix
- Approved: rsalz@uunet.UU.NET
- X-Checksum-Snefru: d84d6380 b5e3c68c cc5c4bff f34fbc20
-
- Submitted-by: Dan Bernstein <brnstnd@nyu.edu>
- Posting-number: Volume 24, Issue 73
- Archive-name: yabbawhap/part01
-
- [ The file PATENT gives a nice summary of the issues. --r$ ]
-
- yabba applies Y compression to its input; unyabba decompresses the
- result. whap applies AP compression to its input; unwhap decompresses
- the result. whap and unwhap run at about the same speed as UNIX compress
- and uncompress, which use LZW coding; yabba and unyabba are two to three
- times slower. AP and Y compression are typically 10-20% more effective
- than LZW compression in the same amount of memory. Y coding, unlike LZW
- coding and AP coding, is unpatented. It should be possible to use these
- programs on any reasonable C platform, though they were originally
- designed on a BSD UNIX system.
-
- #! /bin/sh
- # This is a shell archive. Remove anything before this line, then feed it
- # into a shell via "sh file" or similar. To overwrite existing files,
- # type "sh file -c".
- # The tool that generated this appeared in the comp.sources.unix newsgroup;
- # send mail to comp-sources-unix@uunet.uu.net if you want that tool.
- # Contents: README Makefile ycoding.4b
- # Wrapped by rsalz@litchi.bbn.com on Wed Mar 20 17:09:21 1991
- PATH=/bin:/usr/bin:/usr/ucb ; export PATH
- echo If this archive is complete, you will see the following message:
- echo ' "shar: End of archive 1 (of 4)."'
- if test -f 'README' -a "${1}" != "-c" ; then
- echo shar: Will not clobber existing file \"'README'\"
- else
- echo shar: Extracting \"'README'\" \(8502 characters\)
- sed "s/^X//" >'README' <<'END_OF_FILE'
- Xyabbawhap - Y and AP compression filters
- X
- Xyabba applies Y compression to its input; unyabba decompresses the
- Xresult. whap applies AP compression to its input; unwhap decompresses
- Xthe result. whap and unwhap run at about the same speed as UNIX compress
- Xand uncompress, which use LZW coding; yabba and unyabba are two to three
- Xtimes slower. AP and Y compression are typically 10-20% more effective
- Xthan LZW compression in the same amount of memory. Y coding, unlike LZW
- Xcoding and AP coding, is unpatented. It should be possible to use these
- Xprograms on any reasonable C platform, though they were originally
- Xdesigned on a BSD UNIX system.
- X
- Xyabbawhap version 1.00, March 19, 1991.
- XPlaced into the public domain by Daniel J. Bernstein.
- X
- XThanks to the gamma testers for their comments, criticism, and code:
- X
- X Eirik Fuller <eirik@theory.tn.cornell.edu>
- X Marc Andreessen <andreessen@uimrl7.mrl.uiuc.edu>
- X Loren J. Rittle <cs326ag@ux1.cso.uiuc.edu>
- X Jean-loup Gailly <jloup@chorus.fr>
- X Terje Malmedal <malmedal@dhmolde.no>
- X Lennart Augustsson <augustss@cs.chalmers.se>
- X John C. Schultz <schultz@halley.serc.3m.com>
- X Dave Gudeman <gudeman@cs.arizona.edu>
- X Colin Plumb <ccplumb@rose.waterloo.edu>
- X Ozan Yigit <oz@nexus.yorku.ca>
- X Benno Tietz <tietz@cs.bonn.edu>
- X Alexios Zavras <zvr@theseas.ntua.gr>
- X Hans Henrik Eriksen <hhe@ifi.uio.no>
- X Frank Wales <frank@grep.co.uk>
- X Paul A. Houle <pahsnsr@jupiter.nmt.edu>
- X Graham Thoal <gtoal@ed.ac.uk>
- X Robert Kelley <rjk@sequent.com>
- X
- X
- XOrganization of README:
- X
- X1. Files
- X2. Requirements
- X3. How to configure
- X4. How to compile
- X5. How to install
- X6. TODO list
- X
- X
- X1. Files:
- X
- XBLURB advertisement
- XCHANGES list of changes since first distributed version
- XREADME this file
- XFORMLETTER form letter to send to the author
- XPATENTS some notes on compression patents
- XQUESTIONS questions and answers about yabbawhap
- XFILES file list
- Xsysconf script to test for certain system features
- Xcheckconf.c tool to ease Makefile configuration
- XMakefile compilation commands
- Xtry script to test yabba and unyabba and compare to compress
- Xtryap script to test whap and unwhap and compare to compress
- Xtryapy combination of try and tryap
- XINSTALL script to install the programs
- Xyabba.1 man page for yabba and whap
- Xunyabba.1 man page for unyabba and unwhap
- Xhuptrie.h huptrie library
- Xbitout.h bit output library
- Xpercent.h percent library, computes 100a/b without overflow
- Xtexts.h various messages printed by the programs
- Xyw.c main code for yabba and whap
- Xunwhap.c main code for unwhap
- Xunyabba.c main code for unyabba
- Xbitout.c bit output code
- Xpercent.c percent code
- Xtexts.c texts code
- Xycoding.4b Y Coding, paper draft 4b
- Xycoding.uu yabba'd, uuencoded version of ycoding.4b
- X
- X
- X
- X2. Requirements
- X
- XYou should be able to adapt yabbawhap to practically any C platform
- X(with 8-bit characters). However, this package was designed on a UNIX
- Xsystem. The compressors do not take file names; they only act as
- Xfilters. The support files are also oriented towards UNIX.
- X
- Xyabbawhap has been reported to work on the following systems:
- X
- X Sun 4/280, SunOS 4.0.3 (1.00)
- X Sun 4/280, SunOS 4.0.3 (0.98)
- X Sun 4/280, SunOS 4.0.3, gcc (0.98)
- X Sun 4/330, SunOS 4.0.3 (0.95)
- X Sun 4/330, SunOS 4.0.3c (0.98)
- X SPARC?, SunOS 4.1 (0.95)
- X Sun 3/?, SunOS 4.1 (0.95)
- X Sun 3/160, SunOS 4.1 (1.00)
- X Sun 3/480, SunOS 4.1 (0.98)
- X SPARCStation SLC, SunOS 4.1 (0.95)
- X Sun 3/50, SunOS 4.1.1, gcc (0.98)
- X Sun 3/60, SunOS 4.1.1, gcc (0.98)
- X Sun 3/60, SunOS 4.1.1, gcc 1.39 (0.95)
- X SparcStation 2, SunOS 4.1.1 (0.98)
- X DECStation 5400, Ultrix 4.1 (1.00)
- X DECStation 5000/200, Ultrix 4.1 (0.95)
- X DECStation 5000/200, Ultrix 4.1, gcc (0.98)
- X DECStation 5000/200, Ultrix 4.1 (0.98)
- X may need -Olimit 2500 to compile with DEC's ANSI C compiler V2.10
- X DECStation 3100, Ultrix 4.0 (0.95)
- X DECStation 3100, Ultrix 4.0 (0.98)
- X DECSystem 5820, Ultrix 4.1 (0.98)
- X DECSystem 5820, Ultrix 4.1 (1.00)
- X VAX 11/780, BSD 4.3 (0.95)
- X VAX ?, BSD 4.3, gcc 1.39 (0.95)
- X Tek 4316, Utek 4.1, gcc 1.39 (0.95)
- X need -DBRAINDAMAGED
- X Sequent S811, Dynix 3.0.17 (0.95)
- X Sequent Symmetry, Dynix 3.0.17.9 (0.95)
- X Sequent ?, BSD 4.2? (0.95)
- X Apollo DN3500, DOMAIN/OS SR10.2, cc -A cpu,3000 -W0,-opt,2 (1.00)
- X Convex C1-XP, Convex UNIX 9.0, cc -pcc (1.00)
- X need -pcc for the new compiler; -UPTRS faster than -DPTRS
- X IBM RS/6000, AIX 3.1 (0.95)
- X HP 9000s300, HP/UX 7.0 (0.98)
- X -O rather than -O2, optimizer appears buggy anyway
- X NeXT, NeXT Mach 1.0 (0.98)
- X Astronautics ZS, ZSUnix 1.2 (1.00)
- X Amiga, AmigaOS 1.3.2 (0.95)
- X
- X(Under gcc, always use -O -fstrength-reduce in place of -O2 in CCOPTS.
- XDon't even bother with -finline-functions or the other function-call
- Xoptimizations.)
- X
- XIf your machine isn't in this list, and you get the programs working,
- X*please* send a note to me at brnstnd@nyu.edu on the Internet describing
- Xwhat you had to do to make the programs compile. (Of course, please also
- Xlet me know if you have trouble, or if you have comments, questions, or
- Xsuggestions.) I'd rather be flooded with reports and be able to compile
- Xa more comprehensive list than have no feedback because everyone assumes
- Xsomeone else has talked to me first. You can use FORMLETTER if you want.
- XThanks for being a good sport.
- X
- X
- X
- X3. How to configure
- X
- XFirst, run the sysconf shell script. It will try to figure out whether
- Xyou have bzero(), memset(), and a certain putchar() bug, and will modify
- XMakefile accordingly.
- X
- XNext, make checkconf. Then run checkconf to see a few facts about your
- Xcurrent configuration. You can give it options, like -DNODEMAX=21000,
- Xand it will instantly show you how that change will affect the size of
- Xthe programs if you add it to CCOPTS in the Makefile. It will also make
- Xsure that various constraints are met. checkconf -H shows a help screen.
- X
- XNext, read through the option descriptions in the Makefile, or print out
- Xa copy and peruse it at your leisure. You can configure yabba, unyabba,
- Xwhap, and unwhap in several different ways to change compression size,
- Xspeed, and power. No single configuration is right for every job.
- X
- XFinally, armed with checkconf and the option descriptions, decide how
- Xyou want to configure the programs. Change Makefile appropriately, and
- Xremake checkconf just in case you want to experiment with changes later.
- X
- XIf you want to get through configuration as quickly as possible, run
- X
- X % ./sysconf
- X
- Xand press return when it asks whether it should make and run checkconf.
- XBut don't complain to me about teething trouble if you haven't read
- Xthrough all of README and Makefile, as well as the checkconf output.
- X
- XTwo big caveats: 1. ZEROFILLED should be off on practically any non-UNIX
- Xoperating system. 2. If you increase NODEMAX, you probably want to set
- X-DNODENUM=65533 unless you know your recipients have compiled with the
- Xsame high value of NODEMAX.
- X
- X
- X
- X4. How to compile
- X
- XJust make. You'll get yabba and unyabba. If your optimizer dies, first
- Xtry defining -DOPTCANT5 in the Makefile. If that doesn't work, try
- X-DOPTCANT2. If that doesn't work, try -DOPTCANT1, or lower the
- Xoptimization level.
- X
- XTo test the programs, run the ``try'' shell script with a filename
- Xargument. % ./try yw.shar, for instance. try will give you times and
- Xresults for yabba, unyabba, compress, and uncompress on the file. (Note
- Xthat it does not test for nonexistent files or symbolic links.)
- X
- XIf you want to compile whap and unwhap, beware! AP coding is patented.
- X(Then again, LZW coding is patented, and people use compress all the
- Xtime.) You should understand the information in PATENTS first. Then if
- Xyou're curious to see how well AP coding can do, make AP. You can then
- Xrun the tryap and tryapy shell scripts the same way as try.
- X
- XAnother test you can run is to uudecode ycoding.uu and unyabba -m9999
- Xthe result, ycoding.Y. You should get a perfect copy of ycoding.4b.
- X
- X
- X5. How to install
- X
- XBy default, yabba and unyabba are installed in /usr/local/bin; whap and
- Xunwhap are installed in /usr/local/bin; yabba.1 and unyabba.1 are
- Xinstalled in /usr/man/man1. If you want to change these defaults, edit
- XINSTALL. Then run it from a root shell; it will check every action with
- Xyou before proceeding.
- X
- X
- X6. TODO list
- X
- X-E errs
- X-f filterfile
- X (Please don't bug me about having yabba take a filename argument
- X until you've seen what -f does.)
- X-F
- Xflexible -m? report size? (tnx dg)
- Xmake no-header and random-bits independent?
- Xbetter RESET defaults?
- Xput compression in function? will happen with -f
- Xrename as fwhap, fyabba, funwhap, fyabba? no
- Xuse array trie for top level or two?
- END_OF_FILE
- # end of 'README'
- fi
- if test -f 'Makefile' -a "${1}" != "-c" ; then
- echo shar: Will not clobber existing file \"'Makefile'\"
- else
- echo shar: Extracting \"'Makefile'\" \(9240 characters\)
- sed "s/^X//" >'Makefile' <<'END_OF_FILE'
- XCC=cc
- XCCOPTS=-O2 -DTYPE=int -DPTRS -DBZERO -DRANDINIT="getpid()"
- X
- X# These options would be best to compile ``large'' yabba and unyabba
- X# on a Sun or other generic 32-bit UNIX machine:
- X# CCOPTS=-O2 -DTYPE=int -DPTRS -DBZERO -DZEROFILLED -DRANDINIT="getpid()"
- X# On a System V machine, -DMEMZERO should be used instead of -DBZERO.
- X# If you want more speed, add -DMOD=131072.
- X# For large compressions, try -DNODEMAX=500000 -DMOD=1048576 -DNODENUM=65533.
- X# On a small machine, these options might be more appropriate:
- X# CCOPTS=-O2 -DTYPE=short -DOPTCANT5
- X# To see what the options mean and find out what other options are
- X# available, keep reading.
- X
- X# You should already have run the sysconf script (at least on any UNIX box)
- X# to test for system features. sysconf may have modified the CCOPTS line
- X# on line 2 of this Makefile.
- X
- X# To check your configuration decisions before compiling, type
- X# % make checkconf
- X# % ./checkconf
- X# or something equivalent on non-UNIX systems, and read through
- X# checkconf's output. You can give checkconf any -D or -U options to
- X# see what effect changes will have. On a UNIX machine, you can use
- X# the ``try'' shell script to test yabba and unyabba and see how they
- X# measure up against compress and uncompress in speed and effectiveness.
- X
- X# TYPE and PTRS have the biggest effect on how yabba and unyabba operate.
- X#
- X# TYPE is the type that yabba and unyabba will use for indexing. It
- X# should almost certainly be either short or int. If PTRS is defined,
- X# yabba (and unyabba and whap, but not unwhap) will use pointers rather
- X# than integers wherever possible. On almost all machines yabba runs
- X# faster with -DPTRS. More precisely:
- X#
- X# -UPTRS -DTYPE=short: Use shorts. This is ``small'' yabba and unyabba.
- X# -UPTRS -DTYPE=int: Use ints everywhere. On large machines this is
- X# usually quite a bit faster than the small version, but also uses
- X# twice as much memory.
- X# -DPTRS -DTYPE=short: small unwhap, but use pointers in yabba
- X# wherever possible. This typically doubles yabba's memory use.
- X# -DPTRS -DTYPE=int: Use ints everywhere, with pointers wherever
- X# possible. This is ``large'' yabba and unyabba. On large machines it is
- X# the fastest configuration, but uses twice as much memory as ``small.''
- X#
- X# If you want to generate very large-block compressions (see below)
- X# on a small machine where int is 16 bits, you may want to try compiling
- X# with -DPTRS -DTYPE=long.
- X
- X# NODEMAX and NODENUM control the compression block size.
- X#
- X# Any adaptive dictionary compressor has a block size. It can keep
- X# track of this much input, or this much output, or this many strings
- X# at once, and compress by looking for patterns in the input data.
- X# But eventually there's too much to keep track of, and the compressor
- X# has to stop adapting to the patterns.
- X#
- X# In the LZW-based UNIX ``compress'' program, for instance, the block
- X# size is stated as the number of bits that the coder can use to
- X# identify strings in its dictionary. Once it runs out of identifying
- X# numbers, it can't adapt any further.
- X#
- X# In this compressor, the natural block size is the size of input.
- X# NODENUM sets the number of bytes of input that yabba can adapt to at
- X# once. Past that it will stick to the current dictionary as long as the
- X# patterns seem useful, then clear the strings and start adapting all
- X# over again.
- X#
- X# The user can set NODENUM dynamically with -m. NODEMAX (minimum 512,
- X# default 65533---NODEMAX + 2 must fit into TYPE) sets the maximum
- X# possible value; several fixed arrays in yabba and unyabba are
- X# dimensioned with size NODEMAX. NODENUM defaults to NODEMAX.
- X#
- X# NODENUM and NODEMAX also control the size of input expected by unyabba.
- X# unyabba *must* be given the same -m parameter as yabba was given for
- X# compression. That means that if you compress on one machine where
- X# yabba has NODENUM set to 60000, you won't be able to decompress on
- X# another machine where unyabba has NODEMAX set to 20000. -m has the
- X# same effect on portability that -b does for compress (though -m is
- X# expressed in somewhat more tractable units).
- X
- X# MOD (default 65536) is the size of a hash table kept by yabba. It must
- X# be a power of two. It should be on a similar order of magnitude as
- X# NODEMAX---if it is much too small, the hash table will be too crowded,
- X# and if it is much too big, the table will waste memory.
- X
- X# BITBUFSIZE only affects yabba (at the moment). It is the size of two
- X# output buffers kept by the bit-packing coroutines. It defaults to 1000.
- X
- X# HASHTYPE (default TYPE) is the type used for storing hash values in
- X# yabba. It has little effect on whap memory use, so on large machines
- X# you may want to set -DHASHTYPE=int even for small whap. However, it
- X# does affect memory use in yabba and unyabba, so -DHASHTYPE=short may
- X# be useful. In any case, MOD - 1 must fit into unsigned HASHTYPE.
- X
- X# RESETNUM (default 8192) and RESETFUZZ (default 30, can be negative)
- X# control how yabba decides when to clear its dictionary. After NODENUM or
- X# the -m limit is reached, yabba checks every RESETNUM input characters
- X# whether compression is still worth it. RESETFUZZ has some vaguely
- X# defined effect on the meaning of ``worth it'': the higher RESETFUZZ
- X# is, the longer yabba holds out before clearing the old patterns. The user
- X# can change RESETFUZZ and RESETNUM dynamically, so don't worry about them
- X# too much.
- X
- X# Under -r, yabba needs some reasonably random value from the environment.
- X# If you define an integer RANDINIT, yabba will initialize its generator
- X# based on that value. Otherwise it has to stick to a default sequence,
- X# which does not add as much security. (If you can't think up a good
- X# definition for RANDINIT, open up your phone book and pick a number.)
- X
- X# Many operating systems have brain-damaged putc(): if you put a
- X# (perfectly valid) 255 byte, putc() will return EOF, indicating an error.
- X# This is the case under Ultrix 3.1 and Convex UNIX 8.0, for instance.
- X# You MUST define -DBRAINDAMAGED if sysconf tells you to; otherwise yabba
- X# and unyabba will crash mysteriously. Also complain to your vendor.
- X
- X# If your compiler blows up on yw.c, you may want to set -DOPTCANT5.
- X# This will change certain heavily unrolled loops to lightly unrolled.
- X# If that still doesn't help, try -DOPTCANT2; then -DOPTCANT1.
- X
- X# If the internal representation of the NULL pointer on your machine is
- X# 0, you can make yabba run slightly faster with -DBZERO or -DMEMZERO.
- X# (checkconf will tell you if this is true.) If you have bzero(), use
- X# -DBZERO. If you have ANSI memset(), use -DMEMZERO. If you have
- X# neither, you're out of luck. MEMZERO is ignored under -DBZERO.
- X
- X# If you have a machine with particularly fast memory access (perhaps
- X# certain microcomputers), you might try defining -DHASHPTRS. This will
- X# use some more memory but may run faster. I have not seen any machines
- X# where -DHASHPTRS helps.
- X
- X# Finally, you should set -DZEROFILLED if your operating system
- X# guarantees that static arrays are initialized to zeros. Don't set it
- X# under -DPTRS if NULL pointers don't have internal representation 0.
- X# ZEROFILLED just makes yabba start up a bit faster. (checkconf warns you
- X# if it notices a non-zero-filled array, but its test isn't guaranteed.)
- X# UNIX systems generally guarantee -DZEROFILLED.
- X
- X# All of the above comments apply equally to whap as to yabba except
- X# where otherwise noted.
- X
- Xdefault: all
- X
- XAP: whap unwhap
- X
- Xall: yabba unyabba
- X
- Xcheckconf: checkconf.c Makefile
- X $(CC) $(CCOPTS) -o checkconf checkconf.c
- X
- Xyabba: yabba.o bitout.o percent.o ytexts.o Makefile
- X $(CC) $(CCOPTS) -o yabba yabba.o bitout.o percent.o ytexts.o
- X
- Xunyabba: unyabba.o percent.o ytexts.o Makefile
- X $(CC) $(CCOPTS) -o unyabba unyabba.o percent.o ytexts.o
- X
- Xwhap: whap.o bitout.o percent.o wtexts.o Makefile
- X @echo 'Warning! If you use AP coding except for instruction and amusement,'
- X @echo 'you may be infringing on a patent.'
- X @echo 'If you understand this, press return:'
- X @read foo;
- X $(CC) $(CCOPTS) -DWHAP -o whap whap.o bitout.o percent.o wtexts.o
- X
- Xunwhap: unwhap.o percent.o wtexts.o Makefile
- X $(CC) $(CCOPTS) -DWHAP -o unwhap unwhap.o percent.o wtexts.o
- X
- Xbitout.o: bitout.c bitout.h Makefile
- X $(CC) $(CCOPTS) -c bitout.c
- X
- Xpercent.o: percent.c percent.h Makefile
- X $(CC) $(CCOPTS) -c percent.c
- X
- Xytexts.o: texts.c texts.h Makefile
- X $(CC) $(CCOPTS) -c texts.c
- X mv texts.o ytexts.o
- X
- Xwtexts.o: texts.c texts.h Makefile
- X $(CC) $(CCOPTS) -DWHAP -c texts.c
- X mv texts.o wtexts.o
- X
- Xyabba.o: yw.c bitout.h percent.h texts.h huptrie.h Makefile
- X $(CC) $(CCOPTS) -c yw.c
- X mv yw.o yabba.o
- X
- Xwhap.o: yw.c bitout.h percent.h texts.h huptrie.h Makefile
- X $(CC) $(CCOPTS) -DWHAP -c yw.c
- X mv yw.o whap.o
- X
- Xunyabba.o: unyabba.c bitout.h percent.h texts.h huptrie.h Makefile
- X $(CC) $(CCOPTS) -c unyabba.c
- X
- Xunwhap.o: unwhap.c percent.h texts.h Makefile
- X $(CC) $(CCOPTS) -DWHAP -c unwhap.c
- X
- Xinstall:
- X @echo 'Run INSTALL in a root shell.'
- X exit 1
- X
- Xclean:
- X @echo 'Why do you want to make clean, anyway?'
- X @echo 'If you changed the Makefile, it knows it should remake.'
- X @echo 'If you want to save space, do make shar and then remove everything.'
- X rm -f unwhap.o unyabba.o whap.o yabba.o yw.o texts.o bitout.o percent.o wtexts.o ytexts.o checkconf whap unwhap yabba unyabba
- X
- Xshar:
- X shar `cat FILES` > yw.shar
- X chmod 400 yw.shar
- END_OF_FILE
- if test 9240 -ne `wc -c <'Makefile'`; then
- echo shar: \"'Makefile'\" unpacked with wrong size!
- fi
- # end of 'Makefile'
- fi
- if test -f 'ycoding.4b' -a "${1}" != "-c" ; then
- echo shar: Will not clobber existing file \"'ycoding.4b'\"
- else
- echo shar: Extracting \"'ycoding.4b'\" \(31910 characters\)
- sed "s/^X//" >'ycoding.4b' <<'END_OF_FILE'
- XY coding
- X
- XDaniel J. Bernstein
- X
- XCopyright 1991. All rights reserved.
- XDraft 4b, March 19, 1991.
- X
- XThis is a draft. That means it's supposed to be unreadable, misleading,
- Xboring, useless, and generally wrong. Any deviations from the norm are
- Xaccidents---happy accidents, but accidents nonetheless. End of warning.
- X
- X
- X----- 1. Introduction
- X
- X--- LZW coding
- X
- XFix an alphabet A, and take a string I over that alphabet. Construct a
- X``dictionary'' D---a one-to-one mapping from strings to integers---as
- Xfollows:
- X
- X 0. Start by mapping each symbol of A to a unique integer.
- X 1. Find the longest prefix p of I contained in D. (Rather, contained
- X in the domain of D. It is convenient to ignore this distinction.)
- X 2. Take the next symbol c after p in I.
- X 3. Add pc to the dictionary, mapping to another unique integer.
- X 4. Strip p from the front of I, so that I begins with c.
- X 5. Repeat from step 1 until I is empty (i.e., until step 2 fails).
- X
- XFor example, say A is the set abcdefghijklmnopqrstuvwxyz, and say I is
- Xthe string yabbadabbadabbadoo. D starts with all single characters from
- XA. Now the longest match between I and D is I's first character, y; and
- Xthe charater after that is a. So we add ya to the dictionary and strip y
- Xfrom the front of I.
- X
- XNow I is abbadabbadabbadoo, and D has all single characters plus ya. The
- Xlongest match between D and I is a, so we add ab to the dictionary and
- Xremove the a. We continue in this manner until I is empty:
- X
- X I match add new dictionary
- X yabbadabbadabbadoo y ya ya (plus all single characters)
- X abbadabbadabbadoo a ab ya ab
- X bbadabbadabbadoo b bb ya ab bb
- X badabbadabbadoo b ba ya ab bb ba
- X adabbadabbadoo a ad ya ab bb ba ad
- X dabbadabbadoo d da ya ab bb ba ad da
- X abbadabbadoo ab abb ya ab bb ba ad da abb
- X badabbadoo ba bad ya ab bb ba ad da abb bad
- X dabbadoo da dab ya ab bb ba ad da abb bad dab
- X bbadoo bb bba ya ab bb ba ad da abb bad dab bba
- X adoo ad ado ya ab bb ba ad da abb bad dab bba ado
- X oo o oo ya ab bb ba ad da abb bad dab bba ado oo
- X o o
- X
- XWhile we construct the dictionary we can output the value of each match
- Xunder the dictionary. This produces a sequence of numbers which, as we
- Xwill see below, is sufficient to reconstruct the original string. This
- Xmapping from strings to sequences of numbers is called LZW coding.
- X
- XTypically the numbers in the dictionary are chosen sequentially, from 0
- Xto |A| - 1 for the initial single-character strings and then from |A| on
- Xup. In the above example, the matches are y a b b a d ab ba da bb ad o o,
- Xwhich have numbers 24 0 1 1 0 3 27 29 31 28 30 14 14. That sequence is
- Xthe result of yabbadabbadabbadoo under LZW coding.
- X
- X
- X--- LZW decoding
- X
- XHow do we reconstruct the original string if we're given the coded
- Xsequence? The secret is to reconstruct the dictionary.
- X
- X 0. Start by mapping each symbol of A to a unique integer. Set I to the
- X empty string. Read a number from the input, and set p to the
- X corresponding single-character string.
- X 1. Append p to I.
- X 2. Read a number from the input, and let c be the first character of
- X the corresponding string in the dictionary.
- X 3. Add pc to the dictionary, mapping to the next unique integer.
- X 4. Set p to the dictionary string corresponding to the number read.
- X 5. Repeat from step 1 until end-of-input (i.e., until step 2 fails).
- X
- XFor example, take the input sequence 24 0 1 1 0 3 27 29 31 28 30 14 14.
- XD starts with all single characters from A; p starts as the single
- Xcharacter y; and I starts empty.
- X
- XNow p is appended to I, so that I contains y. The 0 in the input means
- Xthat the next character of I is an a; and ya is added to the dictionary.
- Xp is then set to a.
- X
- XNext, p is appended to I, so that I contains ya. The 1 in the input
- Xmeans that the next character of I is b; and ab is added to the
- Xdictionary. p is then set to b. We continue this way through the entire
- Xinput:
- X
- X input c added new p I
- X 24 y y
- X 0 a (ya,26) a ya
- X 1 b (ab,27) b yab
- X 1 b (bb,28) b yabb
- X 0 a (ba,29) a yabba
- X 3 d (ad,30) d yabbad
- X 27 *a* (da,31) ab yabbadab 27 is ab, so *a* is the 1st char
- X 29 _b_ (abb,32) ba yabbadabba 29 is ba, so _b_ is the 1st char
- X 31 d (bad,33) da yabbadabbada
- X 28 b (dab,34) bb yabbadabbadabb
- X 30 a (bba,35) ad yabbadabbadabbad
- X 14 o (ado,36) o yabbadabbadabbado
- X 14 o (oo,37) o yabbadabbadabbadoo
- X
- XNotice the slightly twisted statement of steps 2 through 4. p is set to
- Xthe dictionary string corresponding to the number read from the input,
- Xbut first c is set to the first character of that string, and the old pc
- Xis added to the dictionary. This roundabout presentation is necessary
- Xbecause the number on the input may be the number of the very last
- Xdictionary string added---and that last string depends on the first
- Xcharacter of the current string. This overlap is always safe but is one
- Xof the first problems to look out for in a new implementation.
- X
- X
- X--- So who cares about this LZW stuff anyway?
- X
- XWhat's the point of LZW coding? When the input string has lots of the
- Xsame characters or sequences of characters, the dictionary will rapidly
- Xgrow to include those sequences. Then a single number in the output
- Xsequence might represent several characters of input, and will use less
- Xspace.
- X
- XFor example, take the simplest natural encoding of a string over our
- Xlowercase alphabet A. There are 26 symbols, so we can represent each
- Xwith 5 bits. Our string yabbadabbadabbadoo has 18 characters and takes
- X90 bits under this encoding.
- X
- XOn the other hand, the string after encoding has just 13 numbers:
- X24 0 1 1 0 3 27 29 31 28 30 14 14. For these values there are 26 27 28
- X29 30 31 32 33 34 35 36 37 38 choices respectively, so they take 5 5 5 5
- X5 5 5 6 6 6 6 6 6 bits in the natural way, for a total of just 71 bits.
- X
- XMost strings that come up in practice have similar redundancy, and LZW
- Xcoding will produce an output that takes less space than its input. This
- Xis why LZW coding is usually called LZW compression. It often reduces
- Xlonger strings to 50% or even 30% of their original size, so that they
- Xtake a fraction as much disk space and communication time as they would
- Xin their uncompressed form.
- X
- X
- X--- A bit of jargon
- X
- XMany audio and video compression methods throw away some information.
- XThey don't reconstruct exactly the original pictures and sounds, but
- Xnobody really notices. These are called inexact, or sometimes lossy,
- Xcompressors. In contrast, LZW always lets you get back exactly the
- Xoriginal string. It is an exact, or lossless, compressor.
- X
- XAny compressor that splits its input into substrings and codes the
- Xstrings with a dictionary is called (logically enough) a dictionary
- Xcompressor.
- X
- XA substring of an input string---particularly a substring that is coded
- Xas a single output number---is called a ``phrase.'' In LZW, one string
- Xis added to the dictionary per phrase.
- X
- XSome compression methods use fixed tables: ``the'' becomes a special,
- Xsingle character, ``of'' becomes another, etc. They're called static, or
- Xnonadaptive, compressors. Others, like LZW, build their dictionaries
- Xdynamically to adapt to any input; LZW is an example of an adaptive
- Xcompressor. Somewhere in between are compressors that make two passes
- Xover the input, building a dictionary the first time through and then
- Xproducing output the second time through. They're called semiadaptive.
- X
- XOne useless theoretical property of LZW is that it is universal. This
- Xmeans that on a certain wide class of inputs it will give as close as
- Xyou want to the best possible compression of any method. The longer your
- Xstrings are, the closer it comes to optimal. Universality doesn't make
- Xone whit of difference in practice---to get within 1% of optimal with
- XLZW you'd have to start compressing the planet---but it generally
- Xindicates that a method is both simple and trustworthy. Non-universal
- Xcompressors are usually much more complicated, and will fail miserably
- Xon inputs they weren't designed specifically to handle.
- X
- X
- X
- X----- 2. Implementing LZW
- X
- X--- Encoding
- X
- XThe most natural way to find the longest match between I and strings in
- XD is one character at a time. Start with p as the first character of I
- X(or as the null string if that is more convenient). Read the next
- Xcharacter, c. If pc is in the dictionary, set p to pc and repeat.
- XOtherwise p and c are set.
- X
- XSo the dictionary has to support two basic operations: searching for pc
- Xif p is known to be there already, and adding pc if it is not there.
- XMost implementors use some form of trie---array trie, list trie, hash
- Xtrie, huptrie---for this job. The array trie, as in Woods' ``hyper-LZ''
- X[9], offers extreme speed at the expense of |A| (typically 256) words
- Xof memory for each string in the trie. The huptrie and straight hash
- Xtrie use only a small amount of memory per string but still allow
- Xreasonably fast searches in the average case. We will not consider these
- Xstructures in further detail.
- X
- XIn any case at most a fixed number of operations happen for every input
- Xcharacter, so LZW coding takes linear time no matter how long the input
- Xis. Unfortunately, computers don't have infinite memory. Implementations
- Xtypically limit the dictionary to some size, usually between 1024 and
- X65536 strings. When the dictionary reaches that size it can either
- Xremain fixed for the rest of the input, or it can be cleared and start
- Xfrom single characters again. The second strategy effectively breaks the
- Xinput into ``blocks'' with independent dictionaries. If the input
- X``feels'' different in different blocks, blocking will produce good
- Xresults, because it will weed useless strings out of the dictionary.
- X
- XThe ``compress'' program [8] introduced an important blocking variant.
- XInstead of clearing the dictionary as soon as it reaches its maximum
- Xsize, compress periodically checks how well it is doing. It will only
- Xclear the dictionary when its compression ratio deteriorates. This way
- Xthe blocks adapt to the input. Note that all of these blocking
- Xtechniques require space for a special output code to clear the
- Xdictionary.
- X
- XAnother variant is LRU replacement: the least-recently-used strings are
- Xremoved from the dictionary at some opportune time. This provides a
- Xsmoother transition between blocks than clearing the dictionary.
- XUnfortunately, it is difficult to find good heuristics for when to
- Xremove what strings. Blocking is still more of an art than a science.
- X
- X
- X--- Preparing for encryption
- X
- XIt is very useful to compress a message before encrypting it, as
- Xcompression removes much of the statistical regularity of straight text.
- XHowever, the usual coding leaves a lot of redundancy. If the dictionary
- Xhas 33 strings, for example, and 6 bits are used to represent a choice
- Xamong those strings, then the high bit will almost always be 0. These
- Xhigh bits come in predictable locations, so an attacker can almost
- Xalways figure out the key given a long enough stretch of compressed
- Xtext.
- X
- XTo prevent this, the compressor should introduce random bits as follows:
- XGiven a choice between 0 and 40, 0 through 22 are encoded as either
- Xthemselves or from 41 through 63. The choice should be made with a
- Xhigh-delay random number generator such as a subtractive generator.
- X23 through 40 are always encoded as themselves. This method greatly
- Xreduces the amount of extra information available per bit without
- Xappreciably slowing down the coding. Because random bits are used at
- Xunpredictable times, an attacker cannot easily take advantage of the
- Xdeterminism of the pseudo-random generator.
- X
- XMost implementations also add a recognizable header to compressed text.
- XThis header should be removed before encryption.
- X
- X
- X--- Decoding
- X
- XLZW decoding is even easier than encoding: the dictionary does not have
- Xto support searching. The easiest (and generally fastest) method is to
- Xkeep I in memory as it is reconstructed, and to keep track of which
- Xsubstring of I a given dictionary number corresponds to. To add pc to
- Xthe dictionary means to add the pair (pointer to start of pc within I,
- Xlength of pc) at the right spot in an array indexed by dictionary
- Xnumbers. There are methods which take slightly less memory than this,
- Xbut they are slower.
- X
- X
- X
- X----- 3. MW and AP coding
- X
- X--- MW: Adapting better
- X
- XLZW adapts to its input relatively slowly: strings in the dictionary
- Xonly get one character longer at a time. If LZW is fed a million
- Xconsecutive x's, its longest dictionary string will only have around
- X1400 x's, and it will only achieve around 1000:1 compression. Is there
- Xany better way for it to notice the redundancy?
- X
- XThe answer is yes. Instead of adding p plus one character of the next
- Xphrase to the dictionary, we can add p plus the entire next phrase to
- Xthe dictionary. This defines MW coding. For example:
- X
- X I match added to dictionary
- X yabbadabbadabbadoo y
- X abbadabbadabbadoo a ya (concatenation of last two matches)
- X bbadabbadabbadoo b ab
- X badabbadabbadoo b bb
- X adabbadabbadoo a ba
- X dabbadabbadoo d ad
- X abbadabbadoo ab dab
- X badabbadoo ba abba
- X dabbadoo dab badab
- X badoo ba dabba
- X doo d bad
- X oo o do
- X o o
- X
- XEven such a short example shows how quickly MW begins to adapt. By the
- Xfifteenth character ``dabba'' is already established as a dictionary
- Xphrase. On the other hand, MW will miss shorter phrases where there is
- Xless redundancy, so it generally performs only somewhat better than LZW
- Xin practice. In this example it outputs 13 numbers, just like LZW. But
- Xgiven a million x's it will end up with a dictionary string of length
- Xaround half a million, and it will output just 30 bytes.
- X
- X
- X
- X
- X--- Problems of MW
- X
- XMW lacks some properties of LZW that we don't realize are so helpful
- Xuntil they are taken away. Most importantly, not every prefix of a
- Xdictionary string is in the dictionary. This means that the
- Xcharacter-at-a-time search method we saw for LZW doesn't work. Instead,
- Xevery prefix of a string in the dictionary must be added to the trie,
- Xand every node in the trie must be given a tag saying whether it is in
- Xthe dictionary or not. Furthermore, finding the longest string may
- Xrequire backtracking: if the dictionary contains xxxx and xxxxxxxx, we
- Xdon't know until we've read to the eighth character of xxxxxxxy that we
- Xhave to choose the shorter string. This seems to imply that MW coding
- Xrequires backtracking and hence is fundamentally slower than LZW coding.
- X(Decoding can be made reasonably fast, though.)
- X
- XAnother problem is that a string may be added to the dictionary twice.
- XThis rules out certain implementations and requires extra care in
- Xothers. It also makes MW a little less safe than LZW before encryption.
- X
- X
- X--- AP: Adapting faster
- X
- XThere is a natural way to preserve the good adaptation of MW while
- Xeliminating its need for backtracking: instead of just concatenating the
- Xlast two phrases and putting the result into the dictionary, put all
- Xprefixes of the concatenation into the dictionary. (More precisely, if S
- Xand T are the last two matches, add St to the dictionary for every
- Xnonempty prefix t of T, including T itself.) This defines AP coding. For
- Xexample:
- X
- X I match added to dictionary
- X yabbadabbadabbadoo y
- X abbadabbadabbadoo a ya
- X bbadabbadabbadoo b ab
- X badabbadabbadoo b bb
- X adabbadabbadoo a ba
- X dabbadabbadoo d ad
- X abbadabbadoo ab da, dab (d + all prefixes of ab)
- X badabbadoo ba abb, abba (ab + all prefixes of ba)
- X dabbadoo dab bad, bada, badab
- X badoo ba dabb, dabba
- X doo d bad
- X oo o do
- X o o
- X
- XSince AP adds more strings to the dictionary, it takes more bits to
- Xrepresent a choice. However, it does provide a fuller range of matches
- Xfor the input string, and in practice it achieves slightly better
- Xcompression than MW in quite a bit less time.
- X
- X
- X----- 4. Y coding
- X
- X--- Completing the square
- X
- XLZW adds one dictionary string per phrase and increments strings by one
- Xcharacter at a time. MW adds one dictionary string per phrase and
- Xincrements strings by several characters at a time. AP adds one
- Xdictionary string per input character and increments strings by several
- Xcharacters at a time. These properties define three broad classes of
- Xmethods and point naturally to a fourth: coders that add one dictionary
- Xstring per input character and increment strings by one character at a
- Xtime. An example of such a method is Y coding. (It is worth noting at
- Xthis point that LZ77 variants are characterized by adding several
- Xdictionary strings per input character.)
- X
- X
- X--- The incomprehensible definition
- X
- XY coding is defined as follows: The dictionary starts with all single
- Xcharacters. One string pc starting at each input position is added to
- Xthe dictionary, where p is in the dictionary before that character and
- Xpc is not.
- X
- XTo put it differently, ``yowie'' appears in the dictionary as soon as
- Xthe regular expression yo.*yow.*yowi.*yowie matches the text. It is
- Xadded at the final e. So yabbadabbadabbadoo leads to the dictionary ya,
- Xab, bb, ba, ad, da, abb, bba, ada, dab, abba, bbad, bado, ado, oo.
- X
- XWe haven't defined the coding yet. While we build the dictionary, we
- Xkeep track of a current longest match (initially empty). Right after
- Xreading an input character and before integrating it into the
- Xdictionary, we see whether the longest match plus that character is
- Xstill in the dictionary *constructed before the start of that longest
- Xmatch*. If so, we use that as the new longest match, and proceed.
- XOtherwise, we output the number of the longest match, and set the
- Xlongest match to just that new character.
- X
- XTo decode, we read these matches, one by one. We take each one as a
- Xsequence of characters as defined by the dictionary, and output that
- Xsequence. Meanwhile we take each character and feed it as input to the
- Xdictionary-building routine.
- X
- X
- X--- A comprehensible example
- X
- XWe can run through the string keeping track of dictionary matches, as
- Xfollows:
- X
- X I add current matches
- X abcabcabcabcabcabcabcx a
- X bcabcabcabcabcabcabcx ab b
- X cabcabcabcabcabcabcx bc c
- X abcabcabcabcabcabcx ca a
- X bcabcabcabcabcabcx ab b
- X cabcabcabcabcabcx abc bc c
- X abcabcabcabcabcx bca ca a
- X bcabcabcabcabcx cab ab b
- X cabcabcabcabcx _____ abc bc c
- X abcabcabcabcx abca / \ \___ bca ca a
- X bcabcabcabcx bcab cab ab b \____ \_/
- X cabcabcabcx cabc abc bc c \__/
- X abcabcabcx abca bca ca a
- X bcabcabcx abcab bcab cab ab b
- X cabcabcx bcabc cabc abc bc c
- X abcabcx cabca abca bca ca a
- X bcabcx ,-----------------abcab bcab cab ab b
- X cabcx abcabc `--bcabc cabc abc bc c
- X abcx bcabca cabca abca bca ca a
- X bcx cabcab abcab bcab cab ab b
- X cx abcabc bcabc cabc abc bc c
- X x abcabcx x
- X bcabcx
- X cabcx
- X abcx
- X bcx
- X cx
- X
- XThe strings on the right side are the current matches-so-far after each
- Xinput character. We start with no matches. When we read a character, we
- Xappend it to each match so far; if the results are in the dictionary, we
- Xmake them the new matches, and add the character as a match by itself.
- XIf any of them are not in the dictionary we take them out of the list
- Xand add them to the dictionary.
- X
- XBefore reading the fifth c, for example, the matches are bcab, cab, ab,
- Xand b. We append c to each one: bcabc doesn't match, so we add it to the
- Xdictionary. The rest are still in the dictionary, so our new list is
- Xcabc, abc, bc, and a lone c. And so on.
- X
- XWhen the x is read, the current list is abcab bcab cab ab b. Now none of
- Xabcabx bcabx cabx abx bx are in the dictionary, so we add all of them,
- Xand the list becomes just a single x.
- X
- X
- X--- The crucial observation
- X
- XIt is not clear so far that Y is a linear-time algorithm: each input
- Xcharacter demands additions to several matches. However, *every
- Xsubstring of a dictionary string is in the dictionary*. This is clear
- Xfrom the regular-expression characterization of dictionary strings: if
- Xyo.*yow.*yowi.*yowie matches the text, then ow.*owi.*owie (for example)
- Xcertainly does as well.
- X
- XSay the current match list is wonka onka nka ka a, and we read the
- Xcharacter x. The next list has to be one of the following:
- X
- X wonkax onkax nkax kax ax x
- X onkax nkax kax ax x
- X nkax kax ax x
- X kax ax x
- X ax x
- X x
- X
- XIf onkax matches, for example, then nkax must match as well, and so must
- Xkax and ax and x.
- X
- XSo the match list will always consist of one string and its suffixes.
- XHence we can keep track of just the longest string in the match list.
- XFor example, with the input string oompaoompapaoompaoompapa:
- X
- X input test add match
- X o o
- X o oo oo o
- X m om om m
- X p mp mp p
- X a pa pa a
- X o ao ao o
- X o oo oo
- X m oom oom om
- X p omp omp mp
- X a mpa mpa pa
- X p pap pap
- X ap p
- X a pa pa
- X o pao pao ao
- X o aoo aoo oo
- X m oom oom
- X p oomp oomp omp
- X a ompa ompa mpa
- X o mpao mpao
- X pao ao
- X o aoo aoo
- X m aoom aoom oom
- X p oomp oomp
- X a oompa oompa ompa
- X p ompap ompap
- X mpap pap
- X a papa papa
- X apa pa
- X
- X
- X--- A comprehensible definition
- X
- XHere is another definition of how to construct the Y dictionary, based
- Xon the chart above.
- X
- X 0. Start with the dictionary mapping every character of A to a unique
- X integer. Set m to the empty string.
- X 1. Append the next character c of I to m.
- X 2. If m is not in the dictionary, then add m to the dictionary, remove
- X the first character of m, and repeat this step.
- X 3. Repeat from step 1 until the input is exhausted.
- X
- XWe must be careful in defining output: the output is not synchronized
- Xwith the dictionary additions, and we must be careful not to have the
- Xlongest output match overlap itself. To this end we split the dictionary
- Xinto two parts, the first part safe to use for the output, the second
- Xpart not safe.
- X
- X 0. Start with S mapping each single character to a unique integer;
- X T empty; m empty; and o empty. (S and T are the two parts of the
- X dictionary.)
- X 1. Read a character c. If oc is in S, set o to oc; otherwise output
- X S(o), set o to c, add T to S, and remove everything from T.
- X 2. While mc is not in S or T, add mc to T (mapping to the next
- X available integer), and chop off the first character of m.
- X 3. After m is short enough that mc is in the dictionary, set m to mc.
- X 4. Repeat as long as there are enough input characters (i.e., until
- X step 1 fails).
- X 5. Output S(o) and quit.
- X
- XHere's how to decode:
- X
- X 0. Initialize (D,m) as above.
- X 1. Read D(o) from the input and take the inverse under D to find o.
- X 2. As long as o is not the empty string, find the first character c of
- X o, and update (D,m) as above. Also output c and chop it off from
- X the front of o.
- X 3. Repeat from step 1 until the input is exhausted.
- X
- XThe coding only requires two fast operations on strings in the
- Xdictionary: testing whether sc is there if s's position is known, and
- Xfinding s's position given cs's position. The decoding only requires the
- Xsame operations plus finding the first character of a string given its
- Xspot in the dictionary.
- X
- X
- X--- Some properties of Y coding
- X
- XWe propose ``per-phrase dictionary'' to describe the dictionaries
- Xconstructed by LZW and MW, ``per-character dictionary'' for AP and Y.
- XWe also propose ``phrase-increment'' to describe MW and AP,
- X``character-increment'' for LZW and Y. More precisely, a per-phrase
- Xdictionary is one which contains one string per output phrase, while a
- Xper-character dictionary contains one string per input character. Each
- Xstring in a character-increment dictionary is exactly one character
- Xlonger than a string added at a previous step; in contrast, a string in
- Xa phrase-increment dictionary can be much longer than any of its
- Xprefixes formerly in the dictionary. According to this terminology,
- XY is an exact character-increment per-character dictionary compressor.
- X
- XY has a similar feel to LZJ, which has a dictionary consisting of all
- Xunique substrings up to a certain length in a block of the input.
- XHowever, Y can adapt much more effectively to redundant text.
- X
- X
- X
- X----- 5. Results
- X
- XThe author has implemented Y coding [2] along with, for instruction and
- Xamusement, AP coding. In this section we survey various implementation
- Xissues and compare the effectiveness of the coding methods explained
- Xhere.
- X
- XThe author's implementation, yabbawhap, uses a huptrie [3] for the
- Xdictionary. yabba keeps track of the separate dictionaries S and T for Y
- Xcoding by remembering the highest position within D that is ``safe''
- Xinside S; all higher positions are in T.
- X
- Xyabbawhap uses compress's strategy of adaptive blocking. The user may
- Xvary most aspects of compression dynamically, including two parameters
- Xof the heuristic used to control blocking.
- X
- XHere are results of these methods on the Bell-Witten-Cleary text corpus,
- Xalong with a highly redundant 180489-byte ``makelog'' added by the
- Xauthor to show an extreme case.
- X
- X __bib__book1__book2____geo_makelog___news___obj1___obj2_paper1_paper2_
- XZ12 54094 394419 325147 78026 34757 230765 16528 160099 29433 40881
- XZ14 46817 357387 281354 77696 25647 202594 14048 138521 25077 37196
- XZ16__46528_332056_250759__77777___25647_182121__14048_128659__25077__36161_
- XMW___41040_336793_230862__80296____8299_168652__13273_109266__22749__34234_
- XAP2 47056 389702 297205 84582 20925 219665 13824 134547 26937 39415
- XAP6 40770 338046 261270 79471 14762 190502 13824 123323 22413 34637
- XAP3__40311_322178_228978__80106____8821_167896__13825_113296__22414__33320_
- XY2 46882 363339 287110 80817 28159 212617 13858 141783 26131 38037
- XY6 40874 320622 256578 76275 21220 185097 13858 125900 22452 33671
- XY3___40456_306813_229851__76695___14411_168287__13859_114323__22453__32733_
- X
- X paper3_paper4_paper5_paper6_____pic__progc__progl__progp__trans_
- XZ12 23567 7091 6670 22333 66236 21825 31987 22936 46185
- XZ14 22163 6957 6580 18695 63277 19143 27116 19209 39605
- XZ16__22163___6957___6580__18695___62215__19143__27148__19209__38240_
- XMW___21495___6697___6169__16899___65102__16976__22223__15095__27742_
- XAP2 22293 6595 6146 19770 74061 18868 27191 17962 38781
- XAP6 20869 6595 6146 16786 68349 16691 22716 15138 30415
- XAP3__20870___6596___6147__16787___67980__16692__22451__15139__28056_
- XY2 21609 6443 6033 19418 71952 18897 27607 19429 40444
- XY6 20355 6443 6033 16677 66117 17063 23625 16616 33026
- XY3___20356___6444___6034__16678___65377__17064__23512__16617__31300_
- X
- XZ12 is LZW with 12 bits (i.e., a dictionary of at most 4096 strings),
- Xusing compress -b12. MW is MW using the author's squeeze program [4],
- Xwith a dictionary of at most 300000 strings (roughly equivalent to
- X1500000 input characters). AP2 is AP with an input block size of 21000,
- Xusing whap -m21000; AP6 has a block size of 65533, and AP3 has a block
- Xsize of 300000. Y is like AP but using yabba.
- X
- XY is notably effective upon book1 and news.
- X
- XXXX what else do people want to see in this section?
- X
- X
- X
- X----- 6. Conclusion
- X
- X--- Life goes on
- X
- XY coding is not much more complex than LZW coding and produces generally
- Xbetter results. It is one of the most effective non-Huffman-based
- XLZ78-style compressors.
- X
- X
- X--- How to achieve fame and fortune
- X
- XIt may be possible to implement Y so that the decoding does not have to
- Xdo all the work of the coding. In particular, three-fourths of the
- Xdictionary strings are typically not used at all; they should not have
- Xto be handled during decoding. Even if Y cannot be sped up, there is
- Xprobably some character-increment per-character dictionary compressor
- Xthat achieves similar results to Y but runs as quickly as LZW or AP.
- XThis is a area for future research.
- X
- XWe have not discussed different ways to code the output numbers. Huffman
- Xcoding and arithmetic coding both produce quite respectable improvements
- Xover and above the compression of the basic Y method. Little is known
- Xabout the best way to code a sequence from a dynamically expanding
- Xalphabet; it is a bit counterintuitive that semiadaptive Huffman coding
- Xupon an output Y sequence can produce worse results than adaptive
- XHuffman coding.
- X
- XIt would be interesting to compare a straight modeller with Y plus a
- XHuffman coder to see which produces better results.
- X
- X
- X--- Acknowledgments
- X
- XThe author would like to thank James Woods for his support and
- Xencouragement, as well as for many copies of papers; Richard Stallman,
- Xfor motivating the author to find Y coding; and Herbert J. Bernstein for
- Xgeneral comments and for suggesting the order of presentation of this
- Xmaterial.
- X
- X
- X--- So who are these LZWMWAPY people anyway?
- X
- XLZW stands for Lempel, Ziv, Welch. In 1977 Ziv and Lempel discovered the
- Xfirst in what are now called the LZ family of dictionary compressors.
- XThe original LZ algorithms were like LZW, but transmitted both p and c
- Xand then skipped past pc. They also started with a dictionary of just
- Xthe null string. (For detailed surveys of various LZ methods, the reader
- Xshould consult [1], [5], [7], [11].) Welch popularized the LZ variant,
- Xnow called LZW, in which the extra character was not transmitted but
- Xbecame the first character of the next phrase.
- X
- XMiller and Wegman independently discovered several LZ variants in 1984,
- Xincluding LZW and MW. In fact, Seery and Ziv had in 1978 [6] proposed an
- XLZ variant where two adjacent phrases were concatenated; this is related
- Xto MW in approximately the same way that LZ is related to LZW. (Seery
- Xand Ziv also introduced an important improvement for binary alphabets:
- Xif 01101 and 01100 are in the dictionary, there is no way that 0110 can
- Xpossibly be a longest match, so it can effectively be removed from the
- Xdictionary.)
- X
- XAP stands for ``all prefixes.'' It was originally discovered by Storer
- Xin 1988 (?) and independently rediscovered by this author in 1990.
- X
- XY actually does stand for yabbadabbadoo. The author discovered Y coding
- Xon December 26, 1990; he used yabbadabbadabbadoo as an example in
- Xexplaining the method to Woods the next day. Woods replied [10] ``I'll
- Xhave to look at your yabbadabbadoo code again to see how it differs from
- XLZR.'' The author promptly adopted ``Y coding'' as the official name.
- X(Ross Williams has suggested [12] that ``LZDB coding'' might be more
- Xappropriate.)
- X
- X
- X--- References
- X
- XYeah, yeah, I'm still writing up proper citations. XXX
- X
- X[1] Bell, Witten, Cleary, compression article, 1989
- X[2] Bernstein, yabbawhap program, 1990-1991
- X[3] Bernstein, huptrie article, 1990
- X[4] Bernstein, squeeze program, 1989
- X[5] Miller and Wegman, compression article, 1987
- X[6] Seery and Ziv, LZ78 extensions article, 1978
- X[7] Storer, compression book, 1988
- X[8] Woods et al., compress program, 1985
- X[9] Woods, private communication, 1990
- X[10] Woods, private communication, 1990
- X[11] Williams, compression book, 1991
- X[12] Williams, private communication, 1991
- END_OF_FILE
- if test 31910 -ne `wc -c <'ycoding.4b'`; then
- echo shar: \"'ycoding.4b'\" unpacked with wrong size!
- fi
- # end of 'ycoding.4b'
- fi
- echo shar: End of archive 1 \(of 4\).
- cp /dev/null ark1isdone
- MISSING=""
- for I in 1 2 3 4 ; do
- if test ! -f ark${I}isdone ; then
- MISSING="${MISSING} ${I}"
- fi
- done
- if test "${MISSING}" = "" ; then
- echo You have unpacked all 4 archives.
- rm -f ark[1-9]isdone
- else
- echo You still must unpack the following archives:
- echo " " ${MISSING}
- fi
- exit 0
- exit 0 # Just in case...
-